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Abstract 

Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson 
paradigm to deal with asymmetric errors in binary classification with a convex loss. Given 
a finite collection of classifiers, we combine them and obtain a new classifier that satisfies 
simultaneously the two following properties with high probability: (i) its probability of type I 
error is below a pre-specified level and (ii), it has probability of type II error close to the 
minimum possible. The proposed classifier is obtained by solving an optimization problem 
with an empirical objective and an empirical constraint. New techniques to handle such 
problems are developed and have consequences on chance constrained programming. 

keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, stochastic 
constraint, convexity, empirical risk minimization, chance constrained optimization. 

1 Introduction 

The Neyman-Pearson (NP) paradigm in statistical learning extends the objective of classical 
binary classification in that, while the latter focuses on minimizing classification error that is 
a weighted sum of type I and type II errors, the former minimizes type II error with an upper 
bound a on type I error. With slight abuse of language, in verbal discussion we do not distinguish 
type I/II error from probability of type I/II error. 

For learning with the NP paradigm, it is essential to avoid one kind of error at the expense 
of the other. As an illustration, consider the following problem in medical diagnosis: failing to 
detect a malignant tumor has far more severe consequences than flagging a benign tumor. Other 
scenarios include spam filtering, machine monitoring, target recognition, etc. 

In the learning context, as true errors are inaccessible, we cannot enforce almost surely the 
desired upper bound for type I error. The best we can hope is that a data dependent classifier 
has type I error bounded with high probability. Henceforth, there are two goals in this project. 
The first is to design a learning procedure so that type I error of the learned classifier / is upper 
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bounded by a pre-specified level with pre-specified high probability; the second is to show that 
/ has good performance bounds for excess type II error. 

This paper is organized as follows. In Section 2, the classical setup for binary classification 
is reviewed and the main notation is introduced. A parallel between binary classification and 
statistical hypothesis testing is drawn in Section 3 with emphasis on the NP paradigm in both 
frameworks. The main propositions, theorems and their proofs are stated in Section 4 while 
secondary, technical results are relegated to the Appendix. Finally, Section 5 illustrates an 
application of our results to chance constrained optimization. 

In the rest of the paper, we denote by xj the j-th coordinate of a vector x G H'^. 

2 Binary classification 

2.1 Classification risk and classifiers 

Let {X, Y) be a random couple where X ^ X C IR"^ is a vector of covariates and Y G {—1, 1} 
is a label that indicates to which class X belongs. A classifier /i is a mapping h : X ^ [—1,1] 
whose sign returns the predicted class given X. An error occurs when —h{X)Y > and it is 
therefore natural to define the classification loss by ll(—h(X)Y > 0), where ll(-) denotes the 
indicator function. 

The expectation of the classification loss with respect to the joint distribution of {X, Y) is 
called (classification) risk and is defined by 

R{h) =F{-h{X)Y > 0). 

Clearly, the indicator function is not convex and for computation, a common practice is to 
replace it by a convex surrogate (see, e.g. Bartlett et al., 2006, and references therein). 
To this end, we rewrite the risk function as 

R{h)=lEM-h{X)Y)], (2.1) 

where ip{z) = ll(z > 0). Convex relaxation can be achieved by simply replacing the indicator 
function by a convex surrogate. 

Definition 2.1. A function (p : [—1, 1] — t- M+ is called a convex surrogate if it is non- decreasing, 
continuous and convex and if if{0) = 1. 

Commonly used examples of convex surrogates are the hinge loss <f{x) = (1 + x)+, the logit 
loss ip{x) = log2(l + e^) and the exponential loss (p{x) = e^. 
For a given choice of (p, define the ip-usk 

R^{h) = ^[ip{-Yh{X))] . 

Hereafter, we assume that ip is fixed and refer to R^p as the risk. In our subsequent analysis, this 
convex relaxation will also be the ground to analyze a stochastic convex optimization problem 
subject to stochastic constraints. A general treatment of such problems can be found in Section 5. 

Because of overfitting, it is unreasonable to look for mappings minimizing empirical risk over 
all calssifiers. Indeed, one could have a small empirical risk but a large true risk. Hence, we 
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resort to regularization. There are in general two ways to proceed. The first is to restrict the 
candidate classifiers to a specific class T-L, and the second is to change the objective function by, 
for example, adding a penalty term. The two approaches can be combined, and sometimes are 
obviously equivalent. 

In this paper, we pursue the first idea by defining the class of candidate classifiers as follows. 
Let hi, . . . ,hM, M > 2 be a given collection of classifiers. In our setup, we allow M to be 
large. In particular, our results remain asymptotically meaningful as long as M = o{e'^). Such 
classifiers are usually called base classifiers and can be constructed in a very naive manner. 
Typical examples include decision stumps or small trees. While the hj's may have no satisfactory 
classifying power individually, for over two decades, boosting type of algorithms have successfully 
exploited the idea that a suitable weighted majority vote among these classifiers may result in 
low classification risk (Schapire, 1990). Consequently, we restrict our search for classifiers to the 
set of functions consisting of convex combinations of the /ij's: 

M 

7{--^{h;, = ^A,/i„AG A}, 

where A denotes the flat simplex of R*^ and is defined by A = {A G IR^''' : Xj > 0, \j = !}• 

In effect, classification rules given by the sign of /i G 7^ conv exactly the set of rules produced 
by the weighted majority votes among the base classifiers hi, ... , hM- 

By restricting our search to classifiers in T-L'^°^^ , the best attainable ip-risk is called oracle 
risk and is abusively denoted by Rip{T-L'^°'^^). As a result, we have Rip{h) > Rip(T-L'^°''") for any 
h G "^conv ^ natural measure of performance for a classifier h G is given by its excess 

risk defined by R^{h) - R^{W°'^''). 

The excess risk of a data driven classifier hn is a random quantity and we are interested 
in bounding it with high probability. Formally, the statistical goal of binary classiflcation is to 
construct a classifier /i„ such that the oracle inequality 

R^{hn) < R^hn^on.) + Ar,{n'°''\6) (2.2) 

holds with probability 1 — 6, where A„(-, •) should be as small as possible. 

In the scope of this paper, we focus on candidate classifiers in the class T-L'^°^^ . Some of the 
following results such as Theorem 4.1 can be extended to more general classes of classifiers with 
known complexity such as classes with bounded VC-dimension, as for example in Cannon et al. 
(2002). However, our main argument for bounding type II error relies on Proposition 4.1 which, 
in turn, depends heavily on the convexity of the problem, and it is not clear how it can be 
extended to more general classes of classifiers. 

2.2 The Neyman-Pearson paradigm 

In classical binary classification, the risk function can be expressed as a convex combination of 
type I error i?-(/i) = W{-Yh{X) > 0\Y = -1) and of type II error i?+(/i) =]P{-Yh{X) > 0\Y = 1): 

R{h) = IP(y = -l)R-{h) + W{Y = l)R+{h). 
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More generally, we can define the ip-type I and ip-type II errors respectively by 



R^{h)=]E[ip{-Yh{X))\Y = -1] and R+{h) = '5E[^{-Yh{X))\Y = l]. 

Following the NP paradigm, for a given class Ti of classifiers, we seek to solve the constrained 
minimization problem: 

min R+{h), (2.3) 

R:^(h)<a 

where a £ (0, 1), the significance level, is a constant specified by the user. 

NP classification is closely related to the NP approach to statistical hypothesis testing. We 
now recall a few key concepts about the latter. Many classical works have addressed the theory 
of statistical hypothesis testing, in particular Lehmann and Romano (2005) provides a thorough 
treatment of the subject. 

Statistical hypothesis testing bears strong resemblance with binary classification if we assume 
the following model. Let P~ and be two probability distributions on X C IR'^. Let p S (0, 1) 
and assume that y is a random variable defined by 

Y _ f ^ with probability p , 

I — 1 with probability 1 — p . 

Assume further that the conditional distribution of X given Y is given by . Given such a 
model, the goal of statistical hypothesis testing is to determine whether X was generated from 
P~ or P+. To that end, we construct a test (/> : — >• [0, 1] and the conclusion of the test based on 
(j) is that X is generated from P+ with probability (piX) and from P~ with probability 1 — </>(X). 
Note that randomness here comes from an exogenous randomization process such as flipping a 
biased coin. Two kinds of errors arise: type I error occurs when rejecting P~ when it is true, 
and type II error occurs when accepting P~ when it is false. The Neyman-Pearson paradigm 
in hypothesis testing amounts to choosing (p that solves the following constrained optimization 
problem 

maximize IE[<^(X)|y = 1] , 
subject to JE[(t){X)\Y = -!]<«, 

where a £ (0, 1) is the significance level of the test. In other words, we specify a significance 
level a on type I error, and minimize type II error. We call a solution to this problem a most 
powerful test of level a. The Neyman-Pearson Lemma gives mild sufficient conditions for the 
existence of such a test. 

Theorem 2.1 (Neyman-Pearson Lemma) . Let P~ and P'^ be probability distributions possessing 
densities p^ and p+ respectively with respect to some measure /i. Let (pk{x) = II(-L(x) > k), 
where the likelihood ratio L{x) = p'^{x)/p~{x) and k is such that P~{L{X) > k) < a and 
P-{L{X) >k)>a. Then, 

• ipk is a level a = IE [(^fc(X)|y = — 1] most powerful test. 
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For a given level a, the most powerful test of level a is defined by 
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a-P-{L{X)>k) 
P-{L(X)=k) 



L{X) > k 
L{X) < k 

L{X) = k 



Notice that in the learning framework, cf) cannot be computed since it requires the knowledge 
of the likelihood ratio and of the distributions P and P'^. Therefore, it remains merely a 
theoretical propositions. Nevertheless, the result motivates the NP paradigm pursued here. 



3 Neyman-Pearson classification via convex optimization 

Recall that in NP classification, the goal is to solve the problem (2.3). This cannot be done 
directly as conditional distributions P^ and P^, and hence and are unknown. In 
statistical applications, information about these distributions is available through two i.i.d. 
samples X^ , . . . , X~_, n~ > 1 and X^,...,X'^^, n+ > 1, where X[ ~ P~,i = l,...,n" 
and ~ P+ ,i = 1 , . . . , n+ . We do not assume that the two samples {X^ , • • • , X^^ ) and 
{X^ , . . . , X^^) are mutually independent. Presently the sample sizes n~ and n""" are assumed to 
be deterministic and will appear in the subsequent finite sample bounds. A different sampling 
scheme, where these quantities are random, is investigated in subsection 4.3. 



3.1 Previous results and new input 

While the binary classification problem has been extensively studied, theoretical proposition on 
how to implement the NP paradigm remains scarce. To the best of our knowledge, Cannon et al. 
(2002) initiated the theoretical treatment of the NP classification paradigm and an early empir- 
ical study can be found in Casasent and Chen (2003). The framework of Cannon et al. (2002) 
is the following. Fix a constant eo > and let T-L he a, given set of classifiers with finite VC 
dimension. They study a procedure that consists of solving the following relaxed empirical 
optimization problem 

min R^(h), (3.1) 
R-{h)<a+eo/2 

where 

R~ih) = —Y.^{HXr)>0), and i?+(/i) = — J] I(/i(Xr) < 0) 

1=1 1=1 

denote the empirical type I and empirical type II errors respectively. Let /i be a solution to 
(3.1). Denote by h* a solution to the original Neyman-Pearson optimization problem: 

h* G argmin R+{h) , (3.2) 

The main result of Cannon et al. (2002) states that, simultaneously with high probability, the 
type II error i?+(/i) is bounded from above by R^(h*) +ei, for some ei > and the type I error 
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of h is bounded from above by a + eo- In a later paper, Cannon et al. (2003) considers problem 
(3.1) for a data-dependent family of classifiers H, and bound estimation errors accordingly. 
Several results for traditional statistical learning such as PAC bounds or oracle inequalities have 
been studied in Scott (2005) and Scott and Nowak (2005) in the same framework as the one laid 
down by Cannon et al. (2002). A noteworthy departure from this setup is Scott (2007) where 
sensible performance measures for NP classification that go beyond analyzing separately two 
kinds of errors are introduced. Furthermore, Blanchard et al. (2010) develops a general solution 
to semi-supervised novelty detection by reducing it to NP classification. Recently, Han et al. 
(2008) transposed several results of Cannon et al. (2002) and Scott and Nowak (2005) to NP 
classification with convex loss. 

The present work departs from previous literature in our treatment of type I error. As a 
matter of fact, the classifiers in all the papers mentioned above can only ensure that TP{R~{h) > 
a+Eo) is small, for some eo > 0. However, it is our primary interest to make sure that R^{h) < a 
with high probability, following the original principle of the Neyman-Pearson paradigm that 
type I error should be controlled by a pre-specified level a. As will be illustrated, to control 
lP{R~{h) > a), it is necessary to have h he a solution to some program with a strengthened 
constraint on empirical type I error. If our concern is only on type I error, we can just do so. 
However, we also want to control excess type II error simultaneously. 

The difficulty was foreseen in the seminal paper Cannon et al. (2002), where it is claimed 
without justification that if we use a' < a for the empirical program, "it seems unlikely that 
we can control the estimation error i?+(/i) — R^{h*) in a distribution independent way". The 
following proposition conffims this opinion in a certain sense. 

Fix a G (0, l),n~ >l,n'^>l and a' < a. Let h{a') be the classifier defined as any solution 
of the following optimization problem: 

min R'^(h) . 

R~{h)<a' 

The following negative result holds not only for this estimator but also for the oracle h*{a') 
defined as the solution of 

min R^(h) . 
R-{h)<a' 

Note that h*{a') is not a classifier but only a pseudo-classifier since it depends on the unknown 
distribution of the data. 

Proposition 3.1. There exist base classifiers /ii,/i2 (md a probability distribution for {X,Y) 
for which, regardless of the sample sizes n~ and n+, any pseudo-classifier h G [/ii,/i2] such that 
R~{h) < a, it holds 

R+(h) - min R+(Xhi + (1 - A)/i2) > a > . 

Ag[0,1] 

In particular, the excess type II risk of h* {a — e^-), > does not converge to zero as sample 
sizes increase even if e^- — )• 0. Moreover, when a < 1/2 for any (pseudo-) classifier h G [/ii,/i2] 
such that R^{h) < a, it holds 

R+(h) - min R+(\hi + (1 - A)/i2) > Q > . (3.3) 

A6[0,l] 
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with probability at least a A 1/4:. In particular, the excess type II risk of h{a — £„-), e„- > 
does not converge to zero with positive probability, as sample sizes increase even if e^- — )• 0. 

The proof of this result is postponed to the appendix. The fact that the oracle h* (a — e„- ) 
satisfies the lower bound indicates that the problem comes from using a strengthened constraint. 
Note that the condition a < 1/2 is purely technical and can be removed. Nevertheless, it is 
always the case in practice that a < 1/2. 

In view of this negative result, it seems that our rightful insist on type I error does not go 
well with the ambition to control type II error simultaneously. To overcome this dilemma, we 
resort to a continuous convex surrogate as our loss function. In particular, we design a modified 
version of empirical risk minimization method such that the data-driven classifier h has type I 
error bounded by a with high probability. Moreover, we consider here a class Ti that allows a 
different treatment of the empirical processes involved. 

This new approach comes with new technical challenges which we summarize here. In the 
approach of Cannon et al. (2002) and of Scott and Nowak (2005), the relaxed constraint on the 
type I error is constructed such that the constraint R^{h) < a + eo/2 on type I error in (3.1) 
is satisfied by h* (defined in (3.2)) with high probability, and that this classifier accommodates 
excess type II error well. As a result, the control of type II error mainly follows as a standard 
exercise to control suprema of empirical processes. This is not the case here; we have to develop 
methods to control the optimum value of a convex optimization problem under a stochastic 
constraint. Such methods have consequences not only in NP classification but also on chance 
constraint programming as explained in Section 5. 



3.2 Convexified NP classifier 

To solve the problem of NP classification (2.3) where the distribution of the observations is 
unknown, we resort to empirical risk minimization. In view of the arguments presented in the 
previous subsection, we cannot simply replace the unknown true risk functions by their empirical 
counterparts. The treatment of the convex constraint should be done carefully and we proceed 
as follows. 

For any classifier h and a given convex surrogate (p, define R~ and i?+ to be the empirical 
counterparts of and i?+ respectively by 

i=l 1=1 

Moreover, for any a > 0, let T^'^'" = {h G n""""" : R^{h) < a} be the set of classifiers 
in Ji'^"™ whose convexified type I errors are bounded from above by a, and let 7^^^ = £ 
-^conv . fl^i^fi^ < a} be the set of classifiers in '}{^"'^^ whose empirical convexified type I errors 
are bounded by a. To make our analysis meaningful, we assume that T-L'^'°' ^ 0. 

We are now in a position to construct a classifier in T^^onv according to the Neyman-Pearson 
paradigm. For any r > such that r < a^fnr , define the convexified NP classifier as any 
classifier that solves the following optimization problem 

min RUh) . (3.4) 

(h)<a—T / \/nF 
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Note that this problem consists of minimizing a convex function subject to a convex constraint 
and can therefore be solved by standard algorithms such as (see, e.g., Boyd and Vandenberghe, 
2004, and references therein). 

In the next section, we present a series of results on type I and type II errors of classifiers 
that are more general than h'^. 

4 Performance Bounds 

4.1 Control of type I error 

The first challenge is to identify classifiers h such that (h) < a with high probability. This is 
done by enforcing its empirical counterpart R^{h) be bounded from above by the quantity 

= a — n/Vn^, 

for a proper choice of positive constant k. 

Theorem 4.1. Fix constants 6,a £ (0, 1), L > and let (p : [—1, 1] — )■ IR"*" be a given L-Lipschitz 
convex surrogate. Define 

Then for any (random) classifier h G T^^onv ^^^^ satisfies R^{h) < Ok, we have 

R-{h) < R^{h) < a. 
with probability at least 1 — S. Equivalently 

P C ?^^'"] > 1 - <5 . (4.1) 

4.2 Simultaneous control of the two errors 

Theorem 4.1 guarantees that any classifier that satisfies the strengthened constraint on the 
empirical (/?-type I error will have (/5-type I error and true type I error bounded from above 
by a. We now check that the constraint is not too strong so that the type II error is overly 
deteriorated. Indeed, an extremely small would certainly ensure a good control of type I error 
but would deteriorate significantly the best achievable type II error. Below, we show not only 
that this is not the case for our approach but also that the convexified NP classifier h'^ defined 
in subsection 3.2 with r = suffers only a small degradation of its type II error compared to 
the best achievable. Analogues to classical binary classification, a desirable result is that with 
high probability, 

i?+(/i"'=) - min i?+(/i) < A„(J^), (4.2) 

where A„(J^) goes to as ?i = n~ + n+ — )■ oo. 

The following proposition is pivotal to our argument. 
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Proposition 4.1. Fix constant a £ (0, 1) and let ip : [—1, 1] — t- be a given continuous 
convex surrogate. Assume further that there exists vq > such that the set of classifiers '}{f'°''~'^o 
is nonempty. Then, for any v G (0, i^o); 

mill Rlih) - min < wfl) — ^— . 

This proposition ensures that if the convex surrogate ^ is continuous, strengthening the 
constraint on type I error does not deteriorate too much the optimal type II error. We should 
mention that the proof does not use the Lipschitz property of but only that it is uniformly 
bounded by on [—1,1]. This proposition has direct consequences on chance constrained 
programming as discussed in Section 5. 

The next theorem shows that the NP classifier defined in subsection 3.2 is a good can- 
didate to perform classification with the Neyman-Pearson paradigm. It relies on the following 
assumption which is necessary to verify the condition of Proposition 4.1. 

Assumption 1. There exists a positive constant e < 1 such that the set of classifiers 'H'^'^"' is 
nonempty. 

Note that this assumption can be tested using (4.1) for large enough n". Indeed, it follows 
from this inequality that with probability 1 — 5, 

^tp,ea~ k/ \Jn^ ^ -j^ip ,£ot— k / ^/ n~+K/ Vn^ q_^Lp,£a 

n~ 

Thus, it is sufficient to check if T^^f ° K/\/rt~ nonempty for some e > 0. Before stating our 
main theorem, we need the following definition. Under Assumption 1, let e denote the smallest 
e such that "H*^'^" ^ and let no be the smallest integer such that 

Theorem 4.2. Let ip, k, b and a be the same as in Theorem ^.1, and h'^ denote any solution 
to (3.4). Moreover, let Assumption 1 hold and assume that n~ > uq where uq is defined in (4.3). 
Then, the following hold with probability 1 — 25, 

R-{h'') < R^ih"") < a (4.4) 

and 

RW)- min i?+(/i) < + (4.5) 

In particular, as M , n~ and n+ all go to infinity and other quantities are held fixed, (4.5) yields 



R-m- mm R^^ih) = o(J'^ + X^] 

Note here that Theorem 4.2 is not exactly of the type (4.2). The right hand side of (4.5) 
goes to zero if both and n+ go to infinity. Moreover, inequality (4.5) conveys a message that 
accuracy of the estimate depends on information from both classes of labeled data. This concern 
motivates us to consider a different sampling scheme. 
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4.3 A Different Sampling Scheme 

We now consider a model for observations that is more standard in statistical learning theory 
(see, e.g., Boucheron et al., 2005; Devroye et al., 1996). 

Let {Xi,Yi), . . . , {Xn,Yn) be n independent copies of the random couple {X,Y) G X x 
{— 1, 1}. Denote by Px the marginal distribution of X and by r]{x) = Wi\Y\X = x] the regression 
function of Y onto X. Denote by p the probability of positive label and observe that 

p = F[y = 1] = E = 1|X]) = i±SWl . 

In what follows, we assume that Px{r]{X) = —1) V Px{'i]{X) = 1) < 1 so that p £ (0, 1). 

Let = cardjyj : = — 1} be the random number of instances labeled —1 and = 
n — N~ = cardjli : Yi = 1}. In this setup, the NP classifier is defined as in subsection 3.2 
where n~ and n+ are replaced by N" and respectively. To distinguish this classifier from h'^ 
previously defined, we denote the NP classifier obtained with this sampling scheme by hl^. 

Let the event T be defined by 



4(/?(1)k 2k. 



^ = {R-^iK) <a}n {R^iK) - min^ R^ih) < ,^ > ^ + ^}. 



Denote B^- = {Y\ = ••• = = —\,Y^-^i = ••• = 1^ = 1}. Although the event B^- is 
different from the event = symmetry leads to the following key observation: 

W{F\N- =n-)=W{F\B^-). 

Therefore, under the conditions of Theorem 4.2, we find that for > uq the event F satisfies 

]P{F\N- =n-)>l-26. (4.6) 

We obtain the following corollary of Theorem 4.2. 

Corollary 4.1. Let (p, k, 5 and a he the same as in Theorem 4-1, o-n-d h'^ be the NP classifier 
obtained with the current sampling scheme. Then under Assumption 1, ifn > 2no/(l— p), where 

_ n(l-p)^ 

riQ is defined in (4.3), we have with probability (1 — 25)(1 — e ^ )j 



and 



R-{K)<R-^{K)<a (4.7) 

A(p{l)K 2k 



RUK)- min R+ih)< ' , + ^= ■ (4. 



Moreover, with probability 1 — 25 — e 2 — e '2 , we have simultaneously (4.7) and 

^^/Tk-n 4\/2w(1)k 2\/2k , 

Rt{h^) - min R+{h) < ^ + ^— . 4.9 

nJ ^\ - (;L_e)ay/n(l-p) ^ 
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5 Chance constrained optimization 



Implementing the Neyman-Pearson paradigm for the convexified binary classification bears 
strong connections with chance constrained optimization. A recent account of such problems 
can be found in Ben-Tal et al. (2009, Chapter 2) and we refer to this book for references and 
applications. A chance constrained optimization problem is of the following form: 

min/(A) s.t. P{F(A, < 0} > 1 - a, (5.1) 

AeA 

where ^ € H is a random vector, A C M*^ is convex, a is a small positive number and / is a 
deterministic real valued convex function. Problem (5.1) can be viewed as a relaxation of robust 
optimization. Indeed, for the latter, the goal is to solve the problem 

min/(A) s.t. supF(A,O<0, (5.2) 
AeA 

and this essentially corresponds to (5.1) for the case a = 0. For simplicity, we take F to 
be scalar valued but extensions to vector valued functions and conic orders are considered in 
Ben-Tal et al. (see, e.g., 2009, Chapter 10). Moreover, it is standard to assume that -F(-,^) is 
convex almost surely. 

Problem (5.1) may not be convex because the chance constraint {A G A : P{F(A,.^) < 0} > 
1 — a} is not convex in general and thus may not be tractable. To solve this problem, Prekopa 
(1995) and Lagoa et al. (2005) have derived sufficient conditions on the distribution of ^ for 
the chance constraint to be convex. On the other hand, Calafiore and Campi (2006) initiated 
a different treatment of the problem where no assumption on the distribution of ^ is made, in 
line with the spirit of statistical learning. In that paper, they introduced the so-called scenario 
approach based on a sample ^i, . . . , ^„ of independent copies of ^. The scenario approach consists 
of solving 

min/(A) s.t. F(X,^i) < 0,i = I, . . . ,n. (5.3) 

AeA 

Calafiore and Campi (2006) showed that under certain conditions, if the sample size n is bigger 
than some n{a,6), then with probability 1 — S, the optimal solution A*'^ of (5.3) is feasible for 
(5.1). The authors did not address the control of the term f{X^^) — f* where /* denotes the 
optimal objective value in (5.1). However, in view of Proposition 3.1, it is very unlikely that 
this term can be controlled well. 

In an attempt to overcome this limitation, a new analytical approach was introduced by (Nemirovski and Shapiro, 
2006). It amounts to solving the following convex optimization problem 

min /(A) s.t. G(A,t)<0, (5.4) 

AeA,teK- V w - > \ J 

in which t is some additional instrumental variable and where G{-,t) is convex. The problem 
(5.4) provides a conservative convex approximation to (5.1), in the sense that every x feasible 
for (5.4) is also feasible for (5.1). Nemirovski and Shapiro (2006) considered a particular class 
of conservative convex approximation where the key step is to replace P{-F(A, ^) > 0} by 
IEy?(F(A, ^)) in (5.1), where f a nonnegative, nondecreasing, convex function that takes value 1 at 
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0. Nemirovski and Shapiro (2006) discussed several choices of ip including hinge and exponential 
losses, with a focus on the latter that they name Bernstein Approximation. 

The idea of a conservative convex approximation is also what we employ in our paper. Recall 
that P~ the conditional distribution of X given Y = —1. In a parallel form of (5.1), we cast 
our target problem as 

mini2+(hA) s.t. P"{hA(X) < 0} > 1 - a, (5.5) 

AeA 

where A is the flat simplex of M*'^. 

Problem (5.5) differs from (5.1) in that R^{\\\) is not a convex function of A. Replacing 
R^{\\\) by R^{\\\) turns (5.5) into a standard chance constrained optimization problem: 

mini?+(hA) s.t. P"{hA(X) < 0} > 1 - a. (5.6) 

AeA ^ 

However, there are two important differences in our setting, so that we cannot use directly 
Scenario Approach or Bernstein Approximation or other analytical approaches to (5.1). First, 
R^{f\) is an unknown function of A. Second, we assume minimum knowledge about P . On the 
other hand, chance constrained optimization techniques in previous literature assume knowledge 
about the distribution of the random vector ^. For example, Nemirovski and Shapiro (2006) 
require that the moment generating function of the random vector ^ is efficiently computable to 
study the Bernstein Approximation. 

Given a finite sample, it is not feasible to construct a strictly conservative approximation to 
the constraint in (5.6). Instead, what possible is to ensure that if we learned h from the sample, 
this constraint is satisfied with high probability 1 — 6, i.e., the classifier is approximately feasible 
for (5.6). In retrospect, our approach to (5.6) is an innovative hybrid between the analytical 
approach based on convex surrogates and the scenario approach. 

We do have structural assumptions on the problem. Let gj,j E {1,...,M} be arbitrary 
functions that take values in [—1, 1] and F{\,^) = '^f=i ^jQjiO- Consider a convexified version 
of (5.1): 

min/(A) s.t. ]E[(/p(F(A,e))] <a, (5.7) 
AeA 

where ipisau L-Lipschitz convex surrogate, L > 0. Suppose that we observe a sample (.^i, . . . ,Cn) 
that are independent copies of ^. We propose to approximately solve the above problem by 

n 

min/(A) s.t. ip{F{\, ^j)) < na — Ky/n , 
AeA 

for some k > to be defined. Denote by A any solution to this problem and by /* the value of 
the objective at the optimum in (5.7). The following theorem summarizes our contribution to 
chance constrained optimization. 

Theorem 5.1. Fix constants 5,a £ (0, 1/2),L > and let if : [—1,1] — ?• IR"*" he a given L- 
Lipschitz convex surrogate. Define 

Then, the following hold with probability at least 1 — 26 
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(i) A is feasible for (5.1). 

(ii) If there exists e G (0, 1) such that the constraint JE[ip{F{X,S^))] < ea is feasible for some 
X £ A, then for 



In particular, as M and n go to infinity with all other quantities kept fixed, we obtain 



The proof essentially follows that of Theorem 4.2 and we omit it. The limitations of The- 
orem 5.1 include rigid structural assumptions on the function F and on the set A. While the 
latter can be easily relaxed using more sophisticated empirical process theory, the former is 
inherent to our analysis. Also, we did not address the effect of replacing the indicator function 
by a convex surrogate; this investigation is beyond the scope of this paper. 

6 Appendix 

6.1 Proof of Proposition 3.1 

Let the base classifiers be defined as 



hi{x) = -I and /i2(x) = I(x < a) - > a) , Va;G[0, 1] 

For any A G [0, 1], denote the convex combination of hi and /12 by h;^ = A/ii + (1 — \)h2, i.e., 

h;,(x) = (1 - 2A)I(x < a) - 2(x > a) . 

Suppose the conditional distributions of X given y = 1 or y = — 1, denoted respectively by 
P+ and P-, are both uniform on [0,1]. Recall that R-{hx) = P~{\\x{X) > 0) and R+{^x) = 
P+{\\x{X) < 0) . Then, we have 




we have 



/(A) - /: < 



4v9(l)«; 



(1 — e)ay/n 




i?-(hA)=P-(hA(X)>0) 



a]I(A < 1/2) . 



(6.1) 



Therefore, for any r G [0, a], we have 




Observe now that 



P+(hA)=P+(hA(X) <0) = (1 



a)I(A < 1/2) + I(A > 1/2). 



(6.2) 
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For any r G [0, a], it yields 

■ f D+ru ^ / 1 - a if T = a , 

Ae[0,l]:_R-(hA)<T 1^ 1 ltT<a. 

Consider now a classifier such that R~{hx) < r for some t < a. Then from (6.1), we see that 
must have A > 1/2. Together with (6.2), this imples that i?+(hA) = 1- It yields 

R+(hx)- min R+ (hx) = 1 - (1 - a) = a . 

X:R-{hx)<a 

This completes the first part of the proposition. Moreover, in the same manner as (6.1), it can 
be easily proved that 

1 

^"(^a) = —yZ I(hA(^r) > 0) = a„- 1I(A < 1/2) , (6.3) 

where 



1=1 



1 " 

- 1{X[ < a) (6.4) 



n . 

1=1 



If a classifier hx is such that R (Ha) < an-, then (6.3) implies that A > 1/2. Using again (6.2), 
we find also that i?+(hA) = 1. It yields 

R+(hx)- min R+ (hx) = 1 - (1 - a) = a . 

X:R-(hx)<a 

It remains to show that R^{hx) < a„- with positive probability for any classifier such that 
R i^x) ^ T for some t < a. Note that a sufficient condition for a classifier to satisfy this 
constraint is to have a < a„- . It is therefore sufficient to find a lower bound on the probability 
of the event A = > «}• Such a lower bound is provided by Lemma 6.4, which guarantees 

that P(^) > a Al/4. 



6.2 Proof of Theorem 4.1 

We begin with the following lemma, which is extensively used in the sequel. Its proof relies 
on standard arguments to bound suprema of empirical processes. Recall that {hi, . . . , Hm} is 
family of M classifiers such that hj : X — )■ [—1, 1] and that for any A in the simplex A C R^ , 
hx denotes the convex combination defined by 

N 

i=i 

The following standard notation in empirical process theory will be used. Let Xi, . . . ,Xn G X 
be n i.i.d random variables with marginal distribution P. Then for any measurable function 
/ : Af — )■ IR, we write 

1 r 

Pnif) = -Yl /(^O and Pif) = TEfiX) = fdP. 
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Moreover, the Rademacher average of / is defined as 

1 " 

Rn{f) = -y2eifiXi), 

i=l 

where ei, . . . ,e„ are i.i.d. Rademacher random variables such that P(ej = 1) = P(ei = —1) = 
1/2 for i = 1, ... , n. 

Lemma 6.1. Fix L > 0,5 G (0)1)- Let Xi, . . . , Xn be n i.i.d random variables on X with 
marginal distribution P. Moreover, let ip : [—1,1] — )■ H an L-Lipschitz function. Then, with 
probability at least 1 — 5, it holds 



sup|(P„-P)((/.o/,a)| < 
AeA 



4V2L 



n 



6 



Proof. Define (p{-) = (p{-) — (p{0), so that ip is an L-Lipschitz function that satisfies V3(0) = 0. 
Moreover, for any A G A, it holds 

{Pn-P){^ohx) = {Pn-P)(.^ohx). 

Let $ : IR — )• IR-)_ be a given convex increasing function. Applying successively the symmetriza- 
tion and the contraction inequalities (see, e.g., Koltchinskii, 2008, Section 2), we find 



]E$ sup|(P„ - P){^ohx)\ < IE^> 2sup|i?„((^o Ha)! < ]E^> 4Lsup|i2n(h 



AeA 



AeA 



AeA 



Observe now that A i— t- |i?„(hA)| is a convex function and Theorem 32.2 in Rockafellar (1997) 
entails that 

sup|i2n(hA)| = max |i2„(/i,)| . 
AeA i<i<Af 

We now use a Chernoff bound to control this quantity. To that end, &x s,t > 0, and observe 
that 

P fsup|(P„-P)(^ohA)| >t] < (sSUp\{Pn-P)i^ohx 



AeA 



< 



^>(st) 
1 

$(7t)' 



AeA 



]E$ ( 4Ls^max Ji2„(/ij)| ) . 



(6.5) 



Moreover, since ^> is increasing. 



]E<I) 4Ls max |i?„,(/i,l 
i<i<A/ 



= E max <^(4Ls\RJhi)\) 
i<i<A/ j'l' 

M 

< ^ E {4LsRn{hj)) V <5 {-4LsRnihj))] 

i=i 
M 

< 2^E$(4Lsi?„(/ij)) . 



(6.6) 
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Now choose <!>(•)= exp(-), then 

IE$ {4LsRn{hj)) = n ^ ^osh ( ^^—^ J < exp ( — — J 



where cosh is the hyperbohc cosine function and where in the inequahty, we used the fact that 
< 1 for any i,j and cosh(x) < exp(2;^/2). Together with (6.5) and (6.6), it yields 

p(^sup|(P„-P)(v9ohA)| < 2Minfexp(^^^-st^ < 2Mexp (^-^^ . 

Choosing 

4:V2L I f2M\ 

completes the proof of the Lemma. □ 

We now proceed to the proof of Theorem 4.1. Note first that from the properties of ip, 
R~{h) < R~{h). Next, we have for any data-dependent classifier h G 7^=°°^ such that R^{h) < 

R-{h)<R-^{h)+ sup R-^{h) - R-^{h) <a-^+ sup R-{h)-R-{h) 
Lemma 6.1 implies that, with probability 1 — 5 



n- hen 



COllV 



sup 



R-Jh)-R-(h) =sup|(P7_ -P-)(99ohA)| < " 



AeA yn 



The previous two displays imply that R^p{h) < a with probability 1 — 6, which completes the 
proof of Theorem 4.1. 

6.3 Proof of Proposition 4.1 

The proof of this proposition builds upon the following lemma. 

Lemma 6.2. Let 7(a) = inf/,^g-^¥),Q R^[h\), then j is a non-increasing convex function on [0, 1]. 

Proof. First, it is clear that 7 is a non-increasing function of a because for a' > a, 
{Ha G H^"-^^ : R^ihx) < a} C {hx G 7^™"^^ : R^{hx) < a'}. 

We now show that 7 is convex. To that end, observe first that since tp is continuous on 
[— 1, 1], the set {A G A : Ha G H^''^} is compact. Moreover, the function A 1— )■ R^{^\) is convex. 
Therefore, there exists A* G A such that 

= h = = ■ 

Now, fix 01,0:2 G [0,1]. From the above considerations, there exist Ai,A2 G A such that 
7(01) = i?J(hAi) and 7(02) = R^i^^)- For any 9 G (0,1), define the convex combinations 
OiQ = 9ai -|- (1 — ^)a2 and Xg = 6\i -|- (1 — 6')A2. Since A 1— R^(hx) is convex, it holds 

R-{h-J < OR^ihx,) + (1 - e)R^{hx,) < ea, + (1 - 9)a2 = ae , 
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so that h)^^ G 'H'^'°'<^. Hence, j{ag) < i?+(hj^^). Together with the convexity of 93, it yields 
^^0a, + (1 - 9)a2) < R^{h-J < eR+{hx,) + (1 - e)R^{hx,) = e^{ai) + (1 - 0)7(a2) ■ 

□ 

We now complete the proof of Proposition 4.1. For any x £ [0, 1], let 7(x) = inf^g-^v^.a: R^{h) 
and observe that the statement of the proposition is equivalent to 

7(0 — u) — 7(0) < ip{l) , < u < uq . (6.7) 

Z/Q — 

Lemma 6.2 together with the assumption that '^'^'""'^o ^ imply that 7 is a non-increasing 
convex real- valued function on [a — vq, 1] so that 

7(a — u) — 7(a) < sup 1^1 , 

where dj{a — v) denotes the sub-differential of 7 at a — z^. Moreover, since 7 is a non-increasing 
convex function on [q — z/q, a — z/], it holds 

7(0 - Vq) - 7(0 - l^) > (z^ - Vq) sup l^l . 

The previous two displays yield 

7(0; — v) — 7(q) < V < u- 



6.4 Proof of Theorem 4.2 

Define the events £~ and iS+ by 



£-= f] {\R-^{h) - R;^{h)\ < ^} , 
f] {\K(^) - R^M < ^} . 

Lemma 6.1 implies 

TP{£-)A'P{£+)>l-5. (6.8) 

Note first that Theorem 4.1 implies that (4.4) holds with probability 1 — 5. Observe now that 
the l.h.s of (4.5) can be decomposed as 

E+(/i")- min R+{h) = Ai + Ai + A3, 

where 

A, = (R^Chn - KC^n) + [KCh^ - , mm , i?+(/i) ) 



A2 = min R^Jh) - min Rt,(h) 



^3 = min i?+(/i) - min i?+(/i). 
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To bound Ai from above, observe that 



Ai<2 sup \R+{h)-R^{h)\<2 sup \R+{h) - R^{h)\. 

h&n'^''^'^ /jg^conv 



Therefore, on the event £^ it holds 

2k 

Ai < 



We now treat A2. Note that < on the event 'H'^'^'^k T^^i"". But this event contains 
£~ so that ^2 < on the event £~ . 

Finally, to control ^3, observe that under Assumption 1, Proposition 4.1 can be applied with 
u = 2Kl\frr and z/q = (1 — £)a. Indeed, the assumptions of the theorem imply that v < fo/2. 
It yields 

(1 — e)a\/ n 

Combining the bounds on yli, A2 and A3 obtained above, we find that (4.5) holds on the event 
£^ n 8^ that has probability at least 1 — 2(5 in view of (6.8). 

The last statement of the theorem follows directly from the definition of k. 

6.5 Proof of Corollary 4.1 

Now prove (4.8), 



W{F) = ^ W{F\N- = n-)W{N- = n") 

n-=0 
n 

> TPil'lN- =n-)W{N- =n~ 



n =no 

> (1 - 26)JP{N- > no) , 
where in the last inequality, we used (4.6). Applying now Lemma 6.3, we obtain 

P(iV- > no) > 1 - e — . 

Therefore, 

P(J-)>(l-25)(l-e-^), 

which completes the proof of (4.8). 

The proof of (4.9) follows by observing that 



4V2lp{1)k 2V2f^ 



R%{K) - _ min^ R+{h) > -—^-^^== H — } C ^iU^2U^3 = (^in^2)U^2UA3 
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Ai = \ KCK) - min R+{h) > ^\ ' + ^= \ C 

I V h^U Ij^'.a r /-I — \ / 7% T /ATI I 



where 

A2 = {N- <n{l-p)/2}, 
A3 = {N+ < np/2} . 

Since A2 C {A^^ > no}, we find 

W{AinA2)< Yl ^{J^^\N- = n-)W{N- = n~) <26 . 

Next, using Lemma 6.3, we get 



IP(-42) < e 2 — and F(^3) < e~~ 



Hence, we find 

P<^/2+(/i^ - mm R+{h)> ^ + ^— } < 25 + e 2 +e 2 , 

which completes the proof of the corollary. 



6.6 Technical lemmas on Binomial distributions 

The following lemmas are purely technical and arise from the fact that we observe binary data. 
They are used in two unrelated results. 

Lemma 6.3. Let N he a binomial random variables with parameters n > 1 and q £ (0,1). 
Then, for any t > such that t < nq/2, it holds 

TPiN >t)>l- e"^ . 

Proof. Note first that n — N has binomial distribution with parameters n> 1 and 1 — 
Therefore, we can write n — N = where Zi are i.i.d. Bernoulli random variables with 

parameter 1 — q. Thus, using Hoeffding's inequality, we find that for any s > 0, 

2s2 

P(n - N -n{l- q)>s) < e~~ . 
Applying the above inequality with s = n — n(l — q) — t > nq/2 > yields 

JP{N >t)= JP{n - N -n{l-q) <n- n{l - q) - t) > I - e — 2" . 

□ 



The next lemma provides a lower bound on the probability that a binomial distribution 
exceeds its expectation. Our result is uniform in the size of the binomial and it can be easily 
verified that it is sharp by considering sizes n = 1 and n = 2. In particular, we do resort to 
Gaussian approximation which improves upon the lower bounds that can be derived from the 
inequalities presented in Slud (1977). 
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Lemma 6.4. Let N be a binomial random variable with parameters n > 1 and < q < 1/2. 
Then, it holds 

P(iV > nq) > g A (1/4). 

Proof. We introduce the following local definition, which is limited to the scope of the this 
proof. Fix n > 1 and for any q G (0,1), let Pq denote the distribution of a binomial random 
variable with parameters n and q. Note first that if n = 1, the result is trivial since 

Pg{N >q)= IP(Z >q)=]P{Z=l)=q, 

where Z is a Bernoulli random variable with parameter q. 

Assume that n > 2. Note that ii q < 1/n, then Pq{N > nq) > JP{Z = 1) = q, where Z 
is a Bernoulli random variable with parameter q. Moreover, for any any integer k such that 
k/n < q < {k + l)/n, we have 

Pg{N > nq) = Pg{N > k + 1) > Pk{N > k + I) . (6.9) 

n 

The above inequality can be easily proved by taking the derivative over the interval {k/n, {k + 
l)/n], of the function 

n / \ 

/ n \ . 

y 



j=k+i 

We now show that 

Pk{N>k + l)>Pk^{N>k), 2<k<n/2. (6.10) 

n n 

Let Ui, . . . ,Un be n i.i.d. random variables uniformly distributed on the interval [0, 1] and 
denote by the corresponding A;th order statistic such that ^7(i) < • • • < C^(n)- Following Feller 
(1971, Section 7.2), it is not hard to show that 

P^(iV > A: + 1) = nUik+i) <l) = n(^~^^ IJ t'^{l - t)^-'-'dt , 
and in the same manner, 

P^(iV>A;)=P(C/(,)<_l) = nQ2j)^ " t'-Hl-ty^'dt. 

Note that 

n — l\ /n — 1 



k — 1 J \ k J n — k ^ 



so that (6.10) follows if we prove 

fc-l k_ 

k [ " t^-\l-t)'^-''dt<{n-k) f^t^{l-t)''-^-^dt. (6.11) 
Jo Jo 
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We can establish the fohowing chain of equivalent inequalities. 



t^-'{l - <{n-k) I r(l - i)"" 

n—k 







(1 - t)"^'^'dt < 



d 

dt 



dt 



t"— r-^ dt + k 







dt<k 



k / r \ n--k 

n 



< k 



n 

71 
fc-1 



dt 



n—k^ 



fe-i 



We now study the variations of the function t i— )■ h{t) = — t)^~^ on the interval [{k — 

1) /n,k/n\. Taking derivative, it is not hard to see that function b admits a unique local optimum, 
which is a maximum, at to = and that to € {{k — l)/n, k/n) because k < n. Therefore, the 
function is increasing on [{k — l)/n, to] and decreasing on [to, A;/n]. It implies that 



6(t)dt > — min 
n 



n n 



Hence, the proof of (6.11) follows from the following two observations: 

'^^ ^"^^ 'l-'-] =-b(-) 



k 

1 

n 



n \ n 



n 



n n' 



and 



1 

n 



< 



I 



n I n n 



n \ n 

While the first equality above is obvious, the second inequality can be obtained by an equivalent 
statement is 



fc-i 



k-l 



n 



n — k 



n 



n—k 



< 



k-l 



n 



k-l 



n — k -\- 1 



n 



n—k 



n — k 
n — k + 1 



n—k 



< 1 



Since the function 1 1— )• (^^)* is increasing on [0, oo), and k < n — k + 1, the result follows. 
To conclude the proof of the Lemma, note that (6.9) and (6.10) imply that for any q > 1/n, 

1 \ n-l / 1 \ 2 

n — 1 



Pg{N > nq) > Pi{N > 2) = 1 



n — 1 



n 



n 



> 1 



1 _ 1 

2 ~ 4' 



where, in the last inequality, we used the fact that the function 



t ^ 1 



t - 1 



t - 1 



t-i 



is increasing on [l,oo). 
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